Minimally-Supervised Attribute Fusion for Data Lakes

نویسندگان

  • Karamjit Singh
  • Garima Gupta
  • Gautam Shroff
  • Puneet Agarwal
چکیده

Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains, even after disparate data is technically ingested into a common data lake. Sometimes this is a missing data issue, while in other cases it may be inherent, e.g., the records in different geographical databases may actually describe different product ‘SKUs’, or follow different norms for categorization. Record linkage techniques, such as [3] can be used to automatically map products in different data sources to a common set of global attributes, thereby enabling federated aggregation joins to be performed. Traditional record-linkage techniques are typically unsupervised, relying textual similarity features across attributes to estimate matches. In this paper, we present an ensemble model combining minimal supervision using Bayesian network models together with unsupervised textual matching for automating such ‘attribute fusion’. We present results of our approach on a large volume of real-life data from a market-research scenario and compare with a standard record matching algorithm. Finally we illustrate how attribute fusion using machine learning could be included as a data-lake management feature, especially as our approach also provides confidence values for matches, enabling human intervention, if required.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Role of Minimally Invasive Spine Surgery in Adults with Degenerative Lumbar Scoliosis: A Narrative Review

Background and Aim: Degenerative lumbar scoliosis is a spinal deformity resulting from advanced disc degeneration and facet arthropathy. Given the inconclusive available literature and lack of high-quality data supporting the role of minimally invasive surgical management of degenerative lumbar scoliosis, this review intends to highlight and compare the various viable minimally invasive surgica...

متن کامل

Comparing minimally supervised home-based and closely supervised gym-based exercise programs in weight reduction and insulin resistance after bariatric surgery: A randomized clinical trial

    Background: Effectiveness of various exercise protocols in weight reduction after bariatric surgery has not been sufficiently explored in the literature. Thus, in the present study, we aimed at comparing the effect of minimally supervised home-based and closely supervised gym-based exercise programs on weight reduction and insulin resistance after bariatric surgery.  &n...

متن کامل

Weakly-supervised Learning of Mid-level Features for Pedestrian Attribute Recognition and Localization

State-of-the-art methods treat pedestrian attribute recognition as a multi-label image classification problem. The location information of person attributes is usually eliminated or simply encoded in the rigid splitting of whole body in previous work. In this paper, we formulate the task in a weakly-supervised attribute localization framework. Based on GoogLeNet, firstly, a set of mid-level att...

متن کامل

Minimally Supervised Japanese Named Entity Recognition: Resources and Evaluation

Approaches to named entity recognition that rely on hand-crafted rules and/or supervised learning techniques have limitations in terms of their portability into new domains as well as in the robustness over time. For the purpose of overcoming those limitations, this paper evaluates named entity chunking and classi cation techniques in Japanese named entity recognition in the context of minimall...

متن کامل

Detecting Surface Waters Using Data Fusion of Optical and Radar Remote Sensing Sensor

Identification and monitoring of surface water using remote sensing have become very important in recent decades due to its importance in human needs and political decisions. Therefore, surface water has been studied using remote sensing systems and Sentinel-1 and Sentinel-2 sensors in this study. In this paper, two data fusion approaches and decision fusion improve the accuracy of surface wate...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1701.01094  شماره 

صفحات  -

تاریخ انتشار 2017